HW 4 - Supervised Learning at Scale.

MIDS w261: Machine Learning at Scale | UC Berkeley School of Information | Spring 2023

Updated: 06/18/2023

HW4 is composed of two sections (see module 6 for access to the HW4 sections):

This notebook and corresponding Digital Campus quiz forms part 2 of 2 for HW4.

In the first three HW assignment, you became familiar with the Map-Reduce programming paradigm as manifested in the Hadoop Streaming and Spark frameworks. We explored how different data structures and design patterns can help us manage the computational complexity of an algorithm. As part of this process you implemented both a supervised learning alogorithm (Naive Bayes) and an unsupervised learning algorithm (synonym detection via cosine similarity). In both of these tasks parallelization helped us manage calculations involving a large number of features. However a large feature space isn't the only situation that might prompt us to want to parallelize a machine learning algorithm. In the final two assignments we'll look at cases where the iterative nature of an algorithm is the main driver of its computational complexity (and the reason we might want to parallelize it).

In this assignment, you will perform learn linear models using different loss functions: OLS, RidgeMSE, LassoMSE, BXE, CXE, RidgeBXE, LASSOBXE, etc.. As in previous assignments, you will implement the core calculations using Spark RDDs, and you will also start working with Spark DataFrames. Though we've provided more of a code base than before since the focus of the latter half of the course is more on general machine learning concepts. By the end of this homework you should be able to:

Additional Reference: latest (3.2.0) Spark Documentation - RDD programming guide

Please refer to the README for homework submission instructions and additional resources.

Notebook Set-Up

Before starting your homework run the following cells to confirm your setup.

Data and HW output vs notebook Locations: private data bucket versus dataproc staging bucket

When you create a DataProc cluster, HDFS is used as the default filesystem. In this course, we override this behavior by setting the defaultFS as a Cloud Storage bucket. The benefits of using Cloud Storage buckets are that your job results get to persist beyond the lifetime of the cluster (and btw latency on these cloud buckets is super low).

In this HW, you use your personal cloud bucket (and folders on them), known as DATA_BUCKET, as:

The datasets for this homework will be downloaded and saved into your private Data Bucket on Google Cloud. Recall that you created a private data bucket during the first week of semester. You may have called it w261-. Jimi's bucket name is w261-jgs. To facilitate ease of access, we have set up location variables for the course-related data buckets. Your private data bucket can be accessed via:

This DATA_BUCKET will be used for hosting the input data for this assignment and also to store output from your HW Assignment apps. Associated with each DataProc cluster is a persistant storage bucket that we refer to as the DataProc Staging Bucket. You will be using this staging bucket to store notebooks and other files associated with your HW assignments, and live sessions. The location of the staging bucket is made available via os.getenv("STAGING_BUCKET"). Since this bucket is persistent, we will no longer need to snapshot your student workspaces.

For more background on Dataproc staging buckets please see:

Start Spark and the Spark UI

OPTIONAL: to view the Spark UI (jobs and stages)

Web UI (aka Application UI or webUI or Spark UI) is the web interface of a running Spark application to monitor and inspect Spark job executions in a web browser. The following is a screenshot of the Spark UI. Feel free to skip this section.

Getting access to Spark UI, you will need to create a tunnel from DataProc via the CloudShell. Please follow all three steps depicted below (starting with running the following code cell):

✅ Question 1: Opimization Theory

As you know from w207, Gradient Descent is an iterative process that seeks to find the optimal parameters for a model given a particular training data set. It does this by using the vector of partial derivatives of a loss function to strategically update parameters in a way that will reduce the loss. In live session 6, you discussed some of the theory behind why gradient descent works and looked at a small example of gradient descent in the context of linear regression.

Q1 Tasks:

Q1 Student Answers:

OPTIONAL - if you'd like to save your answers in this notebook:

image.png

image.png

image.png image.png

image.png

Load data and do EDA

About the Data

For the main task in this portion of the homework you will use data about red and white Portuguese wines. This data was made available to the UC Irvine public repository of Machine Learning datasets by researchers at the University of Minho in association with this paper:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

The dataset includes 12 fields:

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
quality -- (a score between 0 and 10): TARGET Variable

IMPORTANT NOTE: The outcome variable in our data is a human assigned quality score ranging from 0 to 10. Since the scores are integers this is actually an ordinal and not numerical outcome varaible. However for the purposes of this assignment we'll treat it as a numerical quantity.

The data are in two files: one containing red wines and another containing white wines. Use the following cells to download the data, add a field for red/white, and split it into a test and train set.

✅ Question 2: EDA

A statistician's approach to Linear Regression typically involves a series of EDA steps to examine each feature in the data and then a series of steps to test assumptions about their potential contribution to a multi-feature linear model. In particular, we'd want to look for a set of features that exhibit a likely linear relationship with the outcome variable and that are not highy correlated with each other. In the context of machine learning, these considerations remain important techniques for improving model generalizability despite the common practice to use model evaluation techniques (and large data sets) to get the final word on feature selection.

In this question we'll briefly look at the features in our data set. To mimic an 'at scale' analysis we'll start by sampling from our Spark RDD training set so that we have a manageable amount of data to work with in our visuals.

Here, we detour briefly and explore different methods of normality tests based on graphical methods and statistical methods.

Visual Normality Checks: historgrams and Q-Q plot

We can create plots of the data to check whether the each feature is Gaussian distributed. These checks are qualitative, so less accurate than the statistical methods we will calculate in the next section. Nevertheless, they are fast and like the statistical tests, must still be interpreted before you can make a call about your data sample. A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram. A histogram can be created using the hist() matplotlib function. By default, the number of bins is automatically estimated from the data sample. If a variable's data, summarized using histrogram-based form, takes a bell-shape corresponding to Gaussian-like shape, then we can qualitatively say it is is a normally-distributed variable. Another popular plot for checking the distribution of a data sample is the quantile-quantile plot, Q-Q plot, or QQ plot for short.

In this section, we will highlighted two common methods for visually inspecting a dataset to check if it was drawn from a Gaussian distribution.

Statistical Normality Tests

There are many statistical tests that we can use to quantify whether a sample of data looks as though it was drawn from a Gaussian distribution. Each test makes different assumptions and considers different aspects of the data.

We will list commonly used tests that one can apply to ones own data samples:

We explore the Shapiro-Wilk test in detail. Please feel free to use any of these tests to check for normality of variable in a programmatic way. The Shapiro-Wilk test evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution, named for Samuel Shapiro and Martin Wilk. In practice, the Shapiro-Wilk test is believed to be a reliable test of normality, although there is some suggestion that the test may be suitable for smaller samples of data, e.g. thousands of observations or fewer.

The Shapiro-Wilk is a hypotheses test and the two hypotheses are as follows:

The scipy.stats.shapiro() SciPy function calculates the Shapiro-Wilk on a given dataset. The function returns both the W-statistic calculated by the test and the p-value. The complete example of performing the Shapiro-Wilk test on the dataset is given here:.

Q2 Tasks:

Q2 Student Answers:

OPTIONAL - if you'd like to save your answers in this notebook:

a) Type your answer here!

b) Type your answer here!

c) Type your answer here!

d) Type your answer here!

e) Type your answer here!

f) Type your answer here!

g) Type your answer here!

image.png

✅ Question 3: OLS Loss

For a parametric model, the key factor that will impact how easy it is to optimize is your choice of how to define the loss function. In Ordinary Least Squares (OLS) Regression our loss function is just about as convenient as you will get: not only is it convex, its also very easy to interpret.

When doing supervised learning, a simple sanity check consists of comparing one’s estimator against simple rules of thumb. It is useful as a simple baseline to compare with other (real) regressors. Examples of regression baselines include:

In this question you'll "choose" a baseline model and then write a function to compute the loss of a linear model in Spark. You'll reuse this function in Q4 when you implement gradient descent.

Baseline example illustrated:

Q3 Tasks:

Q3 Student Answers:

OPTIONAL - if you'd like to save your answers in this notebook:

a) Type your answer here!

b) Type your answer here!

c) image.png

d) image.png

f) image.png

g) Type your answer here!

h) image.png

✅ Question 4: Vanilla Gradient Descent

Performing Gradient Descent technically only requires two steps: 1) use the current model to calculate the gradient; 2) use the gradient to update the current model parameters. In practice though, we'll want to add a third step which is to compute the loss for our new model so that we can see if its working. In this question you'll implement gradient descent for OLS regression and take a look at a few update steps.

Q4 Tasks:

Q4 Student Answers:

OPTIONAL - if you'd like to save your answers in this notebook:

a) image.png

b)

The formula for the gradient is given by :

$$ \frac{2}{n} \sum_{i=1}^n [\theta^T \cdot x'_i - y_i] \cdot x_i $$

A large gradient and subsequent update would occur under the scenario where we have a very large error (i.e. the weight of the average is really large), assuming all our features have been normalized (there's a trivial case where if the feature values are really large, the steps will be large, too). If our initial model was really far off from the optimal weights, then it makes sense for the model to move in a really large step in the optimal direction.

A small gradient / update would occur under the complement of the previous scenario, where the residual is super small and therefore we are in the vicinity of the optimal solution. A small step then makes sense to arrive at a local/global optimum. The formula for the gradient is given by : .

A large gradient and subsequent update would occur under the scenario where we have a very large error (i.e. the weight of the average is really large), assuming all our features have been normalized (there's a trivial case where if the feature values are really large, the steps will be large, too). If our initial model was really far off from the optimal weights, then it makes sense for the model to move in a really large step in the optimal direction.

A small gradient / update would occur under the complement of the previous scenario, where the residual is super small and therefore we are in the vicinity of the optimal solution. A small step then makes sense to arrive at a local/global optimum.

c) image.png

d) image.png

e) complete the coding portions of this question before answering f

f) image.png

g) complete the coding portions of this question before answering h

h) image.png

Question 4F Expected Output